This weeks data is part of the UN Development Programme, Human Development Reports, which you can find here The data is combined from two different data sets: Human Development Index and its components and Gender Inequality Index
At first I will load the ‘human’ data and then explore it´s structure and dimensions.
setwd("/Users/mirva/IODS-project/data")
human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep=",", header=TRUE)
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155 8
As you can see there are 155 observations and 8 variables in this data. The variables are as follows: * Edu2.FM = The ratio of female and male populations with secondary education in each country (female/ male) * Labo.FM = The ratio of labour force participation of females and males in each country (female/male) * Life.Exp = Life expectancy at birth (years) * Edu.Exp = Expected years of schooling * GNI = Gross national income (GNI) per capita * Mat.Mor = Maternal mortality ratio (deaths per 100 000 live births) * Ado.Birth = Adolescent birth rate (births per 1 000 women ages 15–19) * Parli.F = Share of seats in parliament (% held by women)
Now that you know what the variables used here are let´s look at the graphics and summaries of the data.
library(GGally); library(dplyr); library(corrplot)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ggpairs(human)
cor(human) %>% corrplot()
As you can see from the correlation plots above some correlations are quite strong and some are very minimal. Strongest correlations we can find are between Mat.Mor and Life.Exp (-.857), Edu.Exp and Life.Exp (r=.789), Mat.Mor and Ado.Birth (r=.759), Life.Exp and GNI (r=.627) and Edu.Exp and GNI (r=.624). Weakest correlations are between Edu2.FM and Labo.FM (r=.00956), Labo.FM and GNI (r=-.0217), Labo.FM and Edu.Exp (r=.0473), Ado.Birth and Parli.F (r=-.0709) and Parli.F and Edu2.FM (r=.0786).
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
As we can see from the summaries and also from the upper of the correlation plots above the median of the Edu2.FM is around 0.9375 so the ratio of female and male populations with secondary education is almost 1 meaning almost equal ratio of education. But in some country the minimum is 0.1717 meaning that the women get 2nd education rarely compared to men. The maximum here is 1.4967 meaning that in some countries women are more educated than men.
If we look at the summary of the Labo.FM median is 0.7535 so the men are in labour force more often than women. In many countries women are at home taking care of household and it might explain something of this result. It is interesting that the maximum here is only 1.0380 which tells that women in average don´t work more often than men.
Edu.Exp seems to be quite normally distributed and the median here is 13.50 (expected years of schooling). The 1st and 3rd quartiles are quite near this but the minimum is only 5.40 telling that some of people won´t probably go to school at all or just for few years. Maximum 20.20 years is also quite high meaning that in some countries almost everyone gets at least 2nd education and even higher.
The median of Life.Exp is 74.20 years. Quite shockingly the minimum here is only 49 years which might be a consequense for example of high child mortality. The maximum is 83.50 years so in some country the life expectancy is almost two times larger than in the country where it is 49 years.
As we can see in GNI the range is huge. The minimum is only 581$ and the maximum 123 124$ while the median is 12 040$. We can see from the graph also that GNI is far from uniform (or even normal) distribution and therefore is quite inequal.
Almost the same distribution is true with the Mat.Mor but here the minimum is 1 death, median 49 deaths and maximum 1100 deaths and 50% is between 11.5 and 190 deaths. So this means that in some country there is only 1 death per 100 000 life births (0.001%) and in some country 1100 deaths (1.1%) so the differences are very big but in most countries the death rate is quite low.
The median of the Ado.Birth is surplisingly high, 33.60 which is 3.36% of womean ages 15-19. Also the range is quite big; minimum is 0.60 and maximum 204.80.
Parli.F tells much of the equality between the genders. The minimum here is 0 meaning that in some country there are no women in parliament. The maximum is 57.50 so in some country there are more women in parliament than men. The median is 19.30 so in most countries women are under represented in parliaments compared to men.
Because there are 8 variables it would be hard to interpret the relations between them if we could use only pairwise comparisons. That´s why I will use principal component analysis (PCA) which reduces variables to combinations of the data. I will perform the PCA first on the non standardized human data and then on the standardized human data
pca1_human <- prcomp(human)
s1 <- summary(pca1_human)
s1
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
pca1_pr <- round(100*s1$importance[2,], digits = 1)
pca1_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
pc1_lab <- paste0(names(pca1_pr), " (", pca1_pr, "%)")
biplot(pca1_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc1_lab[1], ylab = pc1_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
human_std <- scale(human)
summary(human_std)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
pca2_human <- prcomp(human_std)
s2 <- summary(pca2_human)
s2
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
pca2_pr <- round(100*s2$importance[2,], digits = 1)
pca2_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
pc2_lab <- paste0(names(pca2_pr), " (", pca2_pr, "%)")
biplot(pca2_human, cex = c(0.8, 0.7), col = c("grey40", "deeppink2"), xlab = pc2_lab[1], ylab = pc2_lab[2])
As you can see the plots created are very different. If we look at first at the PCA on non standardized data we can see that the first component PC1 explains 99,99% of the variation and PC2 0,01% so as we can see from the cumulative proportion together they explain 100% of the variation. Because there are different units of measure in the data variances are quite different and the PCA is really hard to interpret on non standardized data.
If we then look at the PCA on standardized data we can see that there are more components with some proportions and 100% is acquired not until PC8. PC1 and PC2 together explain 69,84% of the variaton. Also the plot is now easier to interpret. If we look at the arrows we can see that there are quite strong correlations between Edu.Exp, Life.Exp, Edu2.FM and GNI and also between Mat.Mor and Ado.Birth as we already noticed earlier. These 6 variables also correlate strongly with PC1. Parli.F and Labo.FM correlate most with each other and PC2. We can also see that for example in Mozambique and Burundi the ratio of labour force participation of females and males in each country and share of seats in parliament are quite high and also the maternal mortality ratio and adolescent birth ratio. In Mozambique and Burundi the ratio of female and male populations with secondary education, expected years of schooling, life expectancy at birth and gross national income (GNI) per capita are quite low. For example Iran has quite opposite pattern.
At first I will load the tea dataset from the FactoMineR package
library(FactoMineR); library(tidyr); library(ggplot2)
data("tea")
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
There are 300 observations and 36 variables in this data.
I choose to keep the following columns: tea.time, tearoom, Tea, How, sugar, how and where.
keep_columns <- c("tea.time", "tearoom", "Tea", "How", "sugar", "how", "where")
tea_time <- dplyr::select(tea, one_of(keep_columns))
summary(tea_time)
## tea.time tearoom Tea How
## Not.tea time:131 Not.tearoom:242 black : 74 alone:195
## tea time :169 tearoom : 58 Earl Grey:193 lemon: 33
## green : 33 milk : 63
## other: 9
## sugar how where
## No.sugar:155 tea bag :170 chain store :192
## sugar :145 tea bag+unpackaged: 94 chain store+tea shop: 78
## unpackaged : 36 tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 7 variables:
## $ tea.time: Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
As you can see from the summary there are the following options in the chosen variables: * tea.time: not tea time or tea time * tearoom: not in a tearoom or in a tearoom * Tea: black, earl grey or green * How: alone, with lemon, milk or other * sugar: no sugar or with sugar * how: with a tea bag, with a tea bag and loose leaf (unpackaged) or only loose leaf * where: from a chain store, from a chain store and a tea shop or only from a tea shop
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables; they will
## be dropped
From the barplots we can see that most of the people have tea time, don´t drink at tearoom, drink earl grey, drink tea alone without any lemon or milk, use sugar, use only tea bags and buy their tea from a chain store.
I will use Multiple Correspondence Analysis (MCA) to inspect the connections between the variables I chose
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.280 0.233 0.185 0.160 0.148 0.141
## % of var. 16.335 13.580 10.802 9.327 8.633 8.213
## Cumulative % of var. 16.335 29.915 40.717 50.044 58.677 66.890
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.131 0.122 0.101 0.089 0.072 0.053
## % of var. 7.635 7.100 5.880 5.205 4.191 3.099
## Cumulative % of var. 74.525 81.625 87.505 92.711 96.901 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | -0.525 0.328 0.256 | 0.021 0.001 0.000 | -0.242 0.105
## 2 | -0.353 0.148 0.082 | -0.076 0.008 0.004 | -0.622 0.696
## 3 | -0.295 0.104 0.140 | -0.148 0.031 0.035 | -0.284 0.145
## 4 | -0.680 0.551 0.645 | -0.105 0.016 0.015 | 0.230 0.095
## 5 | -0.537 0.344 0.414 | -0.040 0.002 0.002 | -0.186 0.062
## 6 | -0.537 0.344 0.414 | -0.040 0.002 0.002 | -0.186 0.062
## 7 | -0.295 0.104 0.140 | -0.148 0.031 0.035 | -0.284 0.145
## 8 | -0.111 0.015 0.009 | -0.183 0.048 0.023 | -0.720 0.933
## 9 | 0.567 0.383 0.199 | -0.472 0.319 0.138 | 0.093 0.015
## 10 | 0.907 0.980 0.384 | -0.147 0.031 0.010 | -0.346 0.215
## cos2
## 1 0.054 |
## 2 0.255 |
## 3 0.129 |
## 4 0.074 |
## 5 0.050 |
## 6 0.050 |
## 7 0.129 |
## 8 0.359 |
## 9 0.005 |
## 10 0.056 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## Not.tea time | -0.505 5.684 0.198 -7.690 | 0.205 1.121 0.032
## tea time | 0.392 4.406 0.198 7.690 | -0.159 0.869 0.032
## Not.tearoom | -0.327 4.392 0.445 -11.538 | 0.046 0.104 0.009
## tearoom | 1.363 18.324 0.445 11.538 | -0.191 0.434 0.009
## black | 0.462 2.684 0.070 4.570 | 0.178 0.479 0.010
## Earl Grey | -0.114 0.425 0.023 -2.643 | -0.248 2.423 0.111
## green | -0.370 0.769 0.017 -2.250 | 1.050 7.443 0.136
## alone | -0.157 0.819 0.046 -3.704 | 0.146 0.849 0.040
## lemon | 0.516 1.492 0.033 3.134 | 0.259 0.453 0.008
## milk | -0.049 0.026 0.001 -0.439 | -0.401 2.069 0.043
## v.test Dim.3 ctr cos2 v.test
## Not.tea time 3.114 | 0.167 0.934 0.021 2.535 |
## tea time -3.114 | -0.129 0.724 0.021 -2.535 |
## Not.tearoom 1.620 | 0.016 0.016 0.001 0.572 |
## tearoom -1.620 | -0.068 0.068 0.001 -0.572 |
## black 1.760 | -0.986 18.487 0.318 -9.753 |
## Earl Grey -5.754 | 0.437 9.462 0.344 10.140 |
## green 6.383 | -0.343 1.000 0.015 -2.087 |
## alone 3.438 | -0.185 1.723 0.064 -4.368 |
## lemon 1.575 | 1.622 22.318 0.325 9.859 |
## milk -3.572 | -0.076 0.092 0.002 -0.674 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## tea.time | 0.198 0.032 0.021 |
## tearoom | 0.445 0.009 0.001 |
## Tea | 0.076 0.169 0.375 |
## How | 0.150 0.106 0.372 |
## sugar | 0.070 0.012 0.393 |
## how | 0.462 0.610 0.028 |
## where | 0.560 0.692 0.106 |
Here we can see that the dimensions 1 and 2 cover 29,915% of the variance. From the last table we can see how well different variables correlate with the first three dimensions.
plot(mca, invisible=c("ind"), habillage = "quali")
From the plot above we can see for example that buying unpackaged tea is closely related to buying tea from a tea shop. Also buying only tea bags is closely related to buying tea from a chain store. Drinking earl grey and drinking tea with lemon are closely related and also drinking black tea and drinking tea with milk.